Alignment (HiSAT2) and counting (featureCounts) done in atmosphere
Will use DESeq2 and other R packages to peek at DEG.
# tell R where to do this work
# navigate to your project directory
setwd(".")
# create a new directory for DESeq2 output
dir.create("DESeq2")
Warning message in dir.create("DESeq2"):
“'DESeq2' already exists”
# load the DESeq2 library -- with return error if not properly installed
library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
table, tapply, union, unique, unsplit, which, which.max, which.min
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: DelayedArray
Loading required package: matrixStats
Warning message:
“package ‘matrixStats’ was built under R version 3.4.4”
Attaching package: ‘matrixStats’
The following objects are masked from ‘package:Biobase’:
anyMissing, rowMedians
Attaching package: ‘DelayedArray’
The following objects are masked from ‘package:matrixStats’:
colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
The following object is masked from ‘package:base’:
apply
# create a new variable that explains where our count data is
directory <- "./featureCounts_precalculated/"
# check the datafiles that are available is this directory
dir (directory)
To conduct our statistical analysis,
we will need to explain the factors related to each file.
For your own experiments, make sure to keep this data organized and safe.
Here is the information we will need:
| filename | genotype | Lgn | nod |
|---|---|---|---|
| dbl19_S46.counts.txt | dbl | Lgn | nod |
| dbl28_S48.counts.txt | dbl | Lgn | nod |
| dbl31_S41.counts.txt | dbl | Lgn | nod |
| dbl50_S43.counts.txt | dbl | Lgn | nod |
| dbl54_S44.counts.txt | dbl | Lgn | nod |
| dbl56_S50.counts.txt | dbl | Lgn | nod |
| dbl57_S52.counts.txt | dbl | Lgn | nod |
| dbl65_S42.counts.txt | dbl | Lgn | nod |
| dbl66_S49.counts.txt | dbl | Lgn | nod |
| dbl70_S47.counts.txt | dbl | Lgn | nod |
| dbl88_S51.counts.txt | dbl | Lgn | nod |
| dbl89_S40.counts.txt | dbl | Lgn | nod |
| dbl9_S45.counts.txt | dbl | Lgn | nod |
| lgn17_S35.counts.txt | Lgn | Lgn | WT |
| lgn21_S36.counts.txt | Lgn | Lgn | WT |
| lgn28_S37.counts.txt | Lgn | Lgn | WT |
| lgn3_S39.counts.txt | Lgn | Lgn | WT |
| lgn33_S38.counts.txt | Lgn | Lgn | WT |
| lgn37_S33.counts.txt | Lgn | Lgn | WT |
| lgn39_S27.counts.txt | Lgn | Lgn | WT |
| lgn42_S34.counts.txt | Lgn | Lgn | WT |
| lgn46_S30.counts.txt | Lgn | Lgn | WT |
| lgn59_S28.counts.txt | Lgn | Lgn | WT |
| lgn72_S31.counts.txt | Lgn | Lgn | WT |
| lgn77_S29.counts.txt | Lgn | Lgn | WT |
| lgn8_S32.counts.txt | Lgn | Lgn | WT |
| nod11_S24.counts.txt | nod | WT | nod |
| nod14_S18.counts.txt | nod | WT | nod |
| nod16_S15.counts.txt | nod | WT | nod |
| nod2_S26.counts.txt | nod | WT | nod |
| nod30_S17.counts.txt | nod | WT | nod |
| nod36_S21.counts.txt | nod | WT | nod |
| nod43_S22.counts.txt | nod | WT | nod |
| nod47_S23.counts.txt | nod | WT | nod |
| nod48_S16.counts.txt | nod | WT | nod |
| nod6_S20.counts.txt | nod | WT | nod |
| nod60_S19.counts.txt | nod | WT | nod |
| nod61_S25.counts.txt | nod | WT | nod |
| nod62_S14.counts.txt | nod | WT | nod |
| WT10_S8.counts.txt | WT | WT | WT |
| WT13_S12.counts.txt | WT | WT | WT |
| WT18_S4.counts.txt | WT | WT | WT |
| WT24_S10.counts.txt | WT | WT | WT |
| WT25_S7.counts.txt | WT | WT | WT |
| WT38_S5.counts.txt | WT | WT | WT |
| WT41_S6.counts.txt | WT | WT | WT |
| WT55_S9.counts.txt | WT | WT | WT |
| WT58_S3.counts.txt | WT | WT | WT |
| WT78_S2.counts.txt | WT | WT | WT |
| WT93_S11.counts.txt | WT | WT | WT |
| WT95_S13.counts.txt | WT | WT | WT |
| WT97_S1.counts.txt | WT | WT | WT |
# Import table info using read.table command
# sample factors are saved in a dataframe called sampleTable
# Note: We could save this info as a text file and import from that file,
# but I am using the direct text import because it is easy to read
sampleTable <- read.table(text="filename genotype Lgn nod
dbl19_S46.counts.txt dbl Lgn nod
dbl28_S48.counts.txt dbl Lgn nod
dbl31_S41.counts.txt dbl Lgn nod
dbl50_S43.counts.txt dbl Lgn nod
dbl54_S44.counts.txt dbl Lgn nod
dbl56_S50.counts.txt dbl Lgn nod
dbl57_S52.counts.txt dbl Lgn nod
dbl65_S42.counts.txt dbl Lgn nod
dbl66_S49.counts.txt dbl Lgn nod
dbl70_S47.counts.txt dbl Lgn nod
dbl88_S51.counts.txt dbl Lgn nod
dbl89_S40.counts.txt dbl Lgn nod
dbl9_S45.counts.txt dbl Lgn nod
lgn17_S35.counts.txt Lgn Lgn WT
lgn21_S36.counts.txt Lgn Lgn WT
lgn28_S37.counts.txt Lgn Lgn WT
lgn3_S39.counts.txt Lgn Lgn WT
lgn33_S38.counts.txt Lgn Lgn WT
lgn37_S33.counts.txt Lgn Lgn WT
lgn39_S27.counts.txt Lgn Lgn WT
lgn42_S34.counts.txt Lgn Lgn WT
lgn46_S30.counts.txt Lgn Lgn WT
lgn59_S28.counts.txt Lgn Lgn WT
lgn72_S31.counts.txt Lgn Lgn WT
lgn77_S29.counts.txt Lgn Lgn WT
lgn8_S32.counts.txt Lgn Lgn WT
nod11_S24.counts.txt nod WT nod
nod14_S18.counts.txt nod WT nod
nod16_S15.counts.txt nod WT nod
nod2_S26.counts.txt nod WT nod
nod30_S17.counts.txt nod WT nod
nod36_S21.counts.txt nod WT nod
nod43_S22.counts.txt nod WT nod
nod47_S23.counts.txt nod WT nod
nod48_S16.counts.txt nod WT nod
nod6_S20.counts.txt nod WT nod
nod60_S19.counts.txt nod WT nod
nod61_S25.counts.txt nod WT nod
nod62_S14.counts.txt nod WT nod
WT10_S8.counts.txt WT WT WT
WT13_S12.counts.txt WT WT WT
WT18_S4.counts.txt WT WT WT
WT24_S10.counts.txt WT WT WT
WT25_S7.counts.txt WT WT WT
WT38_S5.counts.txt WT WT WT
WT41_S6.counts.txt WT WT WT
WT55_S9.counts.txt WT WT WT
WT58_S3.counts.txt WT WT WT
WT78_S2.counts.txt WT WT WT
WT93_S11.counts.txt WT WT WT
WT95_S13.counts.txt WT WT WT
WT97_S1.counts.txt WT WT WT
", sep="\t", header=T)
# R is good at comparing factors against a reference value
# in this case, our reference value is WT
# Here, we use relevel() to explicitly say that 'WT" is the reference
# relevel sampleTable
sampleTable$genotype <- relevel(sampleTable$genotype, "WT")
sampleTable$Lgn <- relevel(sampleTable$Lgn, "WT")
sampleTable$nod <- relevel(sampleTable$nod, "WT")
# tell us about sampleTable
summary(sampleTable)
filename genotype Lgn nod dbl19_S46.counts.txt: 1 WT :13 WT :26 WT :26 dbl28_S48.counts.txt: 1 dbl:13 Lgn:26 nod:26 dbl31_S41.counts.txt: 1 Lgn:13 dbl50_S43.counts.txt: 1 nod:13 dbl54_S44.counts.txt: 1 dbl56_S50.counts.txt: 1 (Other) :46
Read counting softwares will produce tables with a slightly different format.
Make sure to check out your output before calculating.
# create file name and path lists to find our files
# uses the directory variable defined above
file_names <- list.files(path=directory, pattern = "*.txt$") # create a list of all txt files in the "featureCounts" folder, $ means that it must end in '.txt"
file_paths <- list.files(path=directory, pattern = "*.txt$", full.names=T) # create a list of all txt files' full locations in the "featureCounts" folder
# read one file for the header and row name information
featureCounts <- read.table(file_paths[1], sep="\t", header=T) # start by reading the first file
# use cbind to add each column afterwards
counts <- cbind(featureCounts[,7]) # take the file's 7th column (RNA counts in featureCounts format)
rownames(counts) <- featureCounts[,1] # name by gene
colnames(counts) <- c(file_names[1]) # name the first column
head(counts) # check on the data
dim(featureCounts) # before -- raw table with many columns
dim(counts) # after -- just the count data
| dbl19_S46.counts.txt | |
|---|---|
| gene:GRMZM2G059865 | 1270 |
| gene:GRMZM5G888250 | 34 |
| gene:GRMZM2G093344 | 11 |
| gene:GRMZM2G093399 | 0 |
| gene:GRMZM5G809743 | 20 |
| gene:GRMZM5G833153 | 54 |
# "gene:GRMZM2G059865" how ugly! -- this is a leftover from the way our GFF was formated
# use the sub() function to cut out the "gene:" stuff
head( sub("gene:", "", rownames(counts)) ) #test that you can do it
#replace row names
rownames(counts) <- sub("gene:", "", rownames(counts))
head(counts)
| dbl19_S46.counts.txt | |
|---|---|
| GRMZM2G059865 | 1270 |
| GRMZM5G888250 | 34 |
| GRMZM2G093344 | 11 |
| GRMZM2G093399 | 0 |
| GRMZM5G809743 | 20 |
| GRMZM5G833153 | 54 |
# create a loop to add on all the count data
for (f in 2:length(file_paths)) { # loop over all the rest of the files
new_featureCounts <- read.table(file_paths[f],sep="\t", header=T)
print ("Merging file...")
counts <- cbind(counts, new_featureCounts[,7]) # add on the new column
colnames(counts) <- c(file_names[1:f]) # add on the new column's name
}
head(counts) # check on the data
[1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..." [1] "Merging file..."
| dbl19_S46.counts.txt | dbl28_S48.counts.txt | dbl31_S41.counts.txt | dbl50_S43.counts.txt | dbl54_S44.counts.txt | dbl56_S50.counts.txt | dbl57_S52.counts.txt | dbl65_S42.counts.txt | dbl66_S49.counts.txt | dbl70_S47.counts.txt | ⋯ | WT24_S10.counts.txt | WT25_S7.counts.txt | WT38_S5.counts.txt | WT41_S6.counts.txt | WT55_S9.counts.txt | WT58_S3.counts.txt | WT78_S2.counts.txt | WT93_S11.counts.txt | WT95_S13.counts.txt | WT97_S1.counts.txt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GRMZM2G059865 | 1270 | 859 | 739 | 812 | 613 | 700 | 730 | 966 | 773 | 935 | ⋯ | 122 | 724 | 1042 | 953 | 960 | 780 | 1181 | 423 | 427 | 561 |
| GRMZM5G888250 | 34 | 11 | 16 | 22 | 10 | 25 | 8 | 24 | 30 | 16 | ⋯ | 5 | 27 | 25 | 26 | 16 | 24 | 32 | 6 | 16 | 17 |
| GRMZM2G093344 | 11 | 16 | 13 | 11 | 4 | 9 | 13 | 13 | 15 | 21 | ⋯ | 4 | 8 | 20 | 14 | 16 | 7 | 27 | 7 | 6 | 10 |
| GRMZM2G093399 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| GRMZM5G809743 | 20 | 34 | 20 | 16 | 17 | 20 | 15 | 48 | 28 | 20 | ⋯ | 14 | 13 | 21 | 11 | 21 | 9 | 28 | 65 | 25 | 42 |
| GRMZM5G833153 | 54 | 65 | 59 | 54 | 53 | 43 | 42 | 85 | 58 | 59 | ⋯ | 50 | 43 | 78 | 32 | 38 | 28 | 57 | 180 | 92 | 127 |
DESeq2, like many packages uses a negative binomeal normalization method.
We can also input our experimental design to try and adjust for confounding factors.
We will use this method to also determine which genes are differentially expressed
#create dds object using matrix import command
ddsMatrix <- DESeqDataSetFromMatrix(countData=counts,
colData=sampleTable,
design=~genotype)
ddsMatrix
class: DESeqDataSet dim: 39625 52 metadata(1): version assays(1): counts rownames(39625): GRMZM2G059865 GRMZM5G888250 ... GRMZM6G036147 GRMZM6G708185 rowData names(0): colnames(52): dbl19_S46.counts.txt dbl28_S48.counts.txt ... WT95_S13.counts.txt WT97_S1.counts.txt colData names(4): filename genotype Lgn nod
# Filtering by gene
# lets get rid of genes that are not expressed or very hard to detect in this dataset
# there are other ways of considering them
# but having many genes with a low count value will affect our normalization method
ddsMatrix <- ddsMatrix[ rowSums(counts(ddsMatrix, normalized=FALSE) >= 5) > 13, ] #only consider detected genes, >= 1 RPM in more than 13 replicates
#how has it changed
ddsMatrix
class: DESeqDataSet dim: 22490 52 metadata(1): version assays(1): counts rownames(22490): GRMZM2G059865 GRMZM5G888250 ... GRMZM6G699895 GRMZM6G708185 rowData names(0): colnames(52): dbl19_S46.counts.txt dbl28_S48.counts.txt ... WT95_S13.counts.txt WT97_S1.counts.txt colData names(4): filename genotype Lgn nod
#convert into DESeq object -- will estimate dispersion
dds <- DESeq (ddsMatrix)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing -- replacing outliers and refitting for 72 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing
plotDispEsts(dds, main="Dispersion plot")
We can use a PCA plot or other type of dimesional reduction method (MDS, tSNE, uMAP, etc)
to take a quick look at patterns accross datasets.
Ideally, your experimental design will be reflected in your PCA plot
But you might find outliers or other confounding variables have impacted your data.
Ex. If you are studying WT vs Mutant, a healthy datast will have two clusters
one for the WT transcriptomes and one for the Mutant transcriptomes
#create variance stabilized table as internal norm
# vsd is good for quick comparisons between datasets
# this method does not normalize by experimental factors
vsd <- varianceStabilizingTransformation(dds)
#check PCA plot at a few different levels
plotPCA(vsd, intgroup='filename', ntop=100000)
plotPCA(vsd, intgroup='filename', ntop=10000)
plotPCA(vsd, intgroup='filename', ntop=1000)
plotPCA(vsd, intgroup='filename', ntop=100)
plotPCA(vsd, intgroup='filename', ntop=10)
# Color by genotype
plotPCA(vsd, intgroup='genotype', ntop=10000)
plotPCA(vsd, intgroup='Lgn', ntop=10000)
plotPCA(vsd, intgroup='nod', ntop=10000)
# what about the expression pattern of individual genes
# can we find outliers here?
# from the published literature:
# we expect low nod expression in the nod mutant
# and high EcPR4 expression in the Lgn mutant
plotCounts(dds, gene="GRMZM2G134382", intgroup="genotype") #Lgn
plotCounts(dds, gene="GRMZM2G027821", intgroup="genotype") #nod
plotCounts(dds, gene="GRMZM2G075262", intgroup="genotype") #sol
plotCounts(dds, gene="GRMZM2G129189", intgroup="genotype") #EcPR4
# identify outliers
# let's do some plotting to label those outliers on the PCA plot
# load a package called ggplot
library(ggplot2)
# plot PCA then add a layer from ggplot that has the name of the sample
plotPCA(vsd, intgroup='genotype', ntop=10000) + geom_text(aes(label=name),vjust=2)
#outliers are
#nod11_S24.counts.txt
#nod2_S26.counts.txt
#nod61_S25.counts.txt
#WT13_S12.counts.txt
#WT24_S10.counts.txt
#WT95_S13.counts.txt
#WT97_S1.counts.txt
#you can use which to figure out which file index matches
# the number that returns can be used to quickly identify the column
which("nod11_S24.counts.txt" == colnames(dds))
which("nod2_S26.counts.txt" == colnames(dds))
which("nod61_S25.counts.txt" == colnames(dds))
which("WT13_S12.counts.txt" == colnames(dds))
which("WT24_S10.counts.txt" == colnames(dds))
which("WT95_S13.counts.txt" == colnames(dds))
which("WT97_S1.counts.txt" == colnames(dds))
# ok, we want to get rid of files 27,30,38,41,43,51,52
colnames(dds[,c(27,30,38,41,43,50,51,52)]) #check that the numbers have the right name
# drop outliers and check quality again
dds_drop <- dds[,-c(27,30,38,41,43,50,51,52)] #use '-' to remove
# compare the original and the cleaned data
dds
dds_drop
class: DESeqDataSet dim: 22490 52 metadata(1): version assays(5): counts mu cooks replaceCounts replaceCooks rownames(22490): GRMZM2G059865 GRMZM5G888250 ... GRMZM6G699895 GRMZM6G708185 rowData names(30): baseMean baseVar ... maxCooks replace colnames(52): dbl19_S46.counts.txt dbl28_S48.counts.txt ... WT95_S13.counts.txt WT97_S1.counts.txt colData names(6): filename genotype ... sizeFactor replaceable
class: DESeqDataSet dim: 22490 44 metadata(1): version assays(5): counts mu cooks replaceCounts replaceCooks rownames(22490): GRMZM2G059865 GRMZM5G888250 ... GRMZM6G699895 GRMZM6G708185 rowData names(30): baseMean baseVar ... maxCooks replace colnames(44): dbl19_S46.counts.txt dbl28_S48.counts.txt ... WT58_S3.counts.txt WT78_S2.counts.txt colData names(6): filename genotype ... sizeFactor replaceable
Perform the same quality control steps to check
Look at a PCA. Is the pattern the same?
#create variance stabilized table
vsd <- varianceStabilizingTransformation(dds_drop)
# check genotypes in PCA
plotPCA(vsd, intgroup='genotype', ntop=10000)
plotPCA(vsd, intgroup='Lgn', ntop=10000)
plotPCA(vsd, intgroup='nod', ntop=10000)
# re-check known genes
plotCounts(dds_drop, gene="GRMZM2G134382", intgroup="genotype") #Lgn
plotCounts(dds_drop, gene="GRMZM2G027821", intgroup="genotype") #nod
plotCounts(dds_drop, gene="GRMZM2G075262", intgroup="genotype") #sol
plotCounts(dds_drop, gene="GRMZM2G129189", intgroup="genotype") #EcPR4
What do you think?
Did we remove the right files?
### get normalized counts to calculate FPKM
## will normalize to FPKM using column 6 from featureCounts data
FPM <- fpm(dds_drop, robust=T) #use the DESeq2 fpm() command
head(counts) #compare raw counts
head(FPM) #to normalized FPMs
| dbl19_S46.counts.txt | dbl28_S48.counts.txt | dbl31_S41.counts.txt | dbl50_S43.counts.txt | dbl54_S44.counts.txt | dbl56_S50.counts.txt | dbl57_S52.counts.txt | dbl65_S42.counts.txt | dbl66_S49.counts.txt | dbl70_S47.counts.txt | ⋯ | WT24_S10.counts.txt | WT25_S7.counts.txt | WT38_S5.counts.txt | WT41_S6.counts.txt | WT55_S9.counts.txt | WT58_S3.counts.txt | WT78_S2.counts.txt | WT93_S11.counts.txt | WT95_S13.counts.txt | WT97_S1.counts.txt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GRMZM2G059865 | 1270 | 859 | 739 | 812 | 613 | 700 | 730 | 966 | 773 | 935 | ⋯ | 122 | 724 | 1042 | 953 | 960 | 780 | 1181 | 423 | 427 | 561 |
| GRMZM5G888250 | 34 | 11 | 16 | 22 | 10 | 25 | 8 | 24 | 30 | 16 | ⋯ | 5 | 27 | 25 | 26 | 16 | 24 | 32 | 6 | 16 | 17 |
| GRMZM2G093344 | 11 | 16 | 13 | 11 | 4 | 9 | 13 | 13 | 15 | 21 | ⋯ | 4 | 8 | 20 | 14 | 16 | 7 | 27 | 7 | 6 | 10 |
| GRMZM2G093399 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| GRMZM5G809743 | 20 | 34 | 20 | 16 | 17 | 20 | 15 | 48 | 28 | 20 | ⋯ | 14 | 13 | 21 | 11 | 21 | 9 | 28 | 65 | 25 | 42 |
| GRMZM5G833153 | 54 | 65 | 59 | 54 | 53 | 43 | 42 | 85 | 58 | 59 | ⋯ | 50 | 43 | 78 | 32 | 38 | 28 | 57 | 180 | 92 | 127 |
| dbl19_S46.counts.txt | dbl28_S48.counts.txt | dbl31_S41.counts.txt | dbl50_S43.counts.txt | dbl54_S44.counts.txt | dbl56_S50.counts.txt | dbl57_S52.counts.txt | dbl65_S42.counts.txt | dbl66_S49.counts.txt | dbl70_S47.counts.txt | ⋯ | nod60_S19.counts.txt | nod62_S14.counts.txt | WT10_S8.counts.txt | WT18_S4.counts.txt | WT25_S7.counts.txt | WT38_S5.counts.txt | WT41_S6.counts.txt | WT55_S9.counts.txt | WT58_S3.counts.txt | WT78_S2.counts.txt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GRMZM2G059865 | 177.964136 | 159.988289 | 139.809863 | 175.537471 | 147.3960146 | 140.119635 | 168.462592 | 158.824133 | 142.389210 | 152.347293 | ⋯ | 181.963289 | 178.2752629 | 240.178070 | 251.066188 | 223.177206 | 213.068324 | 222.007074 | 207.717729 | 227.616429 | 227.400099 |
| GRMZM5G888250 | 4.764394 | 2.048744 | 3.027007 | 4.755941 | 2.4045027 | 5.004273 | 1.846165 | 3.945941 | 5.526101 | 2.607012 | ⋯ | 3.676026 | 5.5518248 | 5.712612 | 7.251912 | 8.322907 | 5.112004 | 6.056856 | 3.461962 | 7.003582 | 6.161561 |
| GRMZM2G093344 | 1.541422 | 2.979991 | 2.459443 | 2.377971 | 0.9618011 | 1.801538 | 3.000019 | 2.137385 | 2.763051 | 3.421704 | ⋯ | 2.363160 | 2.1590430 | 1.986995 | 4.251121 | 2.466046 | 4.089603 | 3.261384 | 3.461962 | 2.042712 | 5.198817 |
| GRMZM5G809743 | 2.802585 | 6.332482 | 3.783758 | 3.458866 | 4.0876546 | 4.003418 | 3.461560 | 7.891882 | 5.157695 | 3.258766 | ⋯ | 6.564332 | 0.6168694 | 2.483744 | 8.002110 | 4.007326 | 4.294083 | 2.562516 | 4.543825 | 2.626343 | 5.391366 |
| GRMZM5G833153 | 7.566979 | 12.106215 | 11.162086 | 11.673674 | 12.7438642 | 8.607349 | 9.692368 | 13.975208 | 10.683796 | 9.613359 | ⋯ | 13.128664 | 5.8602595 | 10.928475 | 11.753098 | 13.255000 | 15.949452 | 7.454592 | 8.222160 | 8.170846 | 10.975280 |
| AC177838.2_FG015 | 3.363102 | 3.911239 | 4.351322 | 2.810329 | 3.6067540 | 1.801538 | 2.538477 | 4.603598 | 2.578847 | 4.399334 | ⋯ | 3.676026 | 1.8506083 | 4.222365 | 4.501187 | 3.699070 | 5.520964 | 2.562516 | 3.894707 | 2.334527 | 3.850975 |
# featureCounts saves transcript length in BP in column 6 of its output
# get transcript length from featureCounts column 6
featureCounts <- read.table(file_paths[1], sep="\t", header=T) # start by reading the first file
transcript_lengths <- cbind(featureCounts[,6]) # take the 6th column (bp lengths)
rownames(transcript_lengths) <- sub("gene:", "", featureCounts[,1]) # name by gene
head(transcript_lengths)
| GRMZM2G059865 | 2505 |
|---|---|
| GRMZM5G888250 | 506 |
| GRMZM2G093344 | 1012 |
| GRMZM2G093399 | 1087 |
| GRMZM5G809743 | 297 |
| GRMZM5G833153 | 690 |
# divide FPM by matching transcript lengths / 1000
FPM_rownames <- rownames(transcript_lengths) %in% rownames(FPM) #find rownames that are in the final FPM table
subset_transcripts <- transcript_lengths[FPM_rownames,] #take the transcript length
head(subset_transcripts)
# manually check some of the data
head ( FPM [1] / (subset_transcripts/1000) ) #try taking FPM column 1 and divide by transcript length
# looks pretty good, so lets use an apply() function
# runs the same calculation on all columns
FPKM <- apply(FPM, 2, function(column) column / (subset_transcripts / 1000) ) # apply function by column (,"2",)
head(FPKM) #check the output
| dbl19_S46.counts.txt | dbl28_S48.counts.txt | dbl31_S41.counts.txt | dbl50_S43.counts.txt | dbl54_S44.counts.txt | dbl56_S50.counts.txt | dbl57_S52.counts.txt | dbl65_S42.counts.txt | dbl66_S49.counts.txt | dbl70_S47.counts.txt | ⋯ | nod60_S19.counts.txt | nod62_S14.counts.txt | WT10_S8.counts.txt | WT18_S4.counts.txt | WT25_S7.counts.txt | WT38_S5.counts.txt | WT41_S6.counts.txt | WT55_S9.counts.txt | WT58_S3.counts.txt | WT78_S2.counts.txt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GRMZM2G059865 | 71.043567 | 63.867580 | 55.812321 | 70.074839 | 58.8407244 | 55.935982 | 67.250536 | 63.402847 | 56.842000 | 60.817282 | ⋯ | 72.640036 | 71.167770 | 95.879469 | 100.226023 | 89.092697 | 85.05722 | 88.625578 | 82.921249 | 90.864842 | 90.778483 |
| GRMZM5G888250 | 9.415799 | 4.048901 | 5.982226 | 9.399094 | 4.7519816 | 9.889867 | 3.648548 | 7.798303 | 10.921149 | 5.152199 | ⋯ | 7.264874 | 10.971986 | 11.289747 | 14.331842 | 16.448432 | 10.10277 | 11.970071 | 6.841822 | 13.841072 | 12.176997 |
| GRMZM2G093344 | 1.523144 | 2.944656 | 2.430279 | 2.349773 | 0.9503963 | 1.780176 | 2.964445 | 2.112040 | 2.730287 | 3.381130 | ⋯ | 2.335138 | 2.133442 | 1.963434 | 4.200712 | 2.436805 | 4.04111 | 3.222712 | 3.420911 | 2.018490 | 5.137171 |
| GRMZM5G809743 | 9.436313 | 21.321487 | 12.739926 | 11.646015 | 13.7631467 | 13.479522 | 11.655085 | 26.571994 | 17.365975 | 10.972275 | ⋯ | 22.102129 | 2.077001 | 8.362775 | 26.943130 | 13.492678 | 14.45819 | 8.628000 | 15.299075 | 8.842907 | 18.152746 |
| GRMZM5G833153 | 10.966636 | 17.545239 | 16.176937 | 16.918368 | 18.4693684 | 12.474419 | 14.046911 | 20.253925 | 15.483762 | 13.932404 | ⋯ | 19.027050 | 8.493130 | 15.838369 | 17.033476 | 19.210145 | 23.11515 | 10.803757 | 11.916174 | 11.841806 | 15.906203 |
| AC177838.2_FG015 | 4.295149 | 4.995196 | 5.557244 | 3.589181 | 4.6063270 | 2.300815 | 3.241989 | 5.879436 | 3.293547 | 5.618561 | ⋯ | 4.694797 | 2.363484 | 5.392548 | 5.748642 | 4.724227 | 7.05104 | 3.272690 | 4.974084 | 2.981517 | 4.918232 |
# should change col names to reflect that these are not counts
sub(".counts.txt", "", colnames(FPKM)) #test that you can do it
#replace column names
colnames(FPKM) <- sub(".counts.txt", "", colnames(FPKM))
head(FPKM)
| dbl19_S46 | dbl28_S48 | dbl31_S41 | dbl50_S43 | dbl54_S44 | dbl56_S50 | dbl57_S52 | dbl65_S42 | dbl66_S49 | dbl70_S47 | ⋯ | nod60_S19 | nod62_S14 | WT10_S8 | WT18_S4 | WT25_S7 | WT38_S5 | WT41_S6 | WT55_S9 | WT58_S3 | WT78_S2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GRMZM2G059865 | 71.043567 | 63.867580 | 55.812321 | 70.074839 | 58.8407244 | 55.935982 | 67.250536 | 63.402847 | 56.842000 | 60.817282 | ⋯ | 72.640036 | 71.167770 | 95.879469 | 100.226023 | 89.092697 | 85.05722 | 88.625578 | 82.921249 | 90.864842 | 90.778483 |
| GRMZM5G888250 | 9.415799 | 4.048901 | 5.982226 | 9.399094 | 4.7519816 | 9.889867 | 3.648548 | 7.798303 | 10.921149 | 5.152199 | ⋯ | 7.264874 | 10.971986 | 11.289747 | 14.331842 | 16.448432 | 10.10277 | 11.970071 | 6.841822 | 13.841072 | 12.176997 |
| GRMZM2G093344 | 1.523144 | 2.944656 | 2.430279 | 2.349773 | 0.9503963 | 1.780176 | 2.964445 | 2.112040 | 2.730287 | 3.381130 | ⋯ | 2.335138 | 2.133442 | 1.963434 | 4.200712 | 2.436805 | 4.04111 | 3.222712 | 3.420911 | 2.018490 | 5.137171 |
| GRMZM5G809743 | 9.436313 | 21.321487 | 12.739926 | 11.646015 | 13.7631467 | 13.479522 | 11.655085 | 26.571994 | 17.365975 | 10.972275 | ⋯ | 22.102129 | 2.077001 | 8.362775 | 26.943130 | 13.492678 | 14.45819 | 8.628000 | 15.299075 | 8.842907 | 18.152746 |
| GRMZM5G833153 | 10.966636 | 17.545239 | 16.176937 | 16.918368 | 18.4693684 | 12.474419 | 14.046911 | 20.253925 | 15.483762 | 13.932404 | ⋯ | 19.027050 | 8.493130 | 15.838369 | 17.033476 | 19.210145 | 23.11515 | 10.803757 | 11.916174 | 11.841806 | 15.906203 |
| AC177838.2_FG015 | 4.295149 | 4.995196 | 5.557244 | 3.589181 | 4.6063270 | 2.300815 | 3.241989 | 5.879436 | 3.293547 | 5.618561 | ⋯ | 4.694797 | 2.363484 | 5.392548 | 5.748642 | 4.724227 | 7.05104 | 3.272690 | 4.974084 | 2.981517 | 4.918232 |
#write FPKM to file
#export as TSV
write.table(FPKM,
file="./DESeq2/FPKM_outlierdrop_normalized.13s.5c.txt",
sep="\t", quote=F)
Determine differentially expressed genes and compare between variables
1) Lgn vs wt against nod vs wt
2) double vs wt against Lgn vs wt against nod vs wt
what do the comparisons mean?
sampleTable$genotype
# 1) how does Lgn vs wt compare with nod vs wt
LgnvWT <- results(dds_drop, contrast=c("genotype", "Lgn", "WT"), alpha=0.05) #calculates Lgn / WT
nodvWT <- results(dds_drop, contrast=c("genotype", "nod", "WT"), alpha=0.05) #calculates nod / WT
#export as TSV
write.table(as.data.frame(LgnvWT),
file="./DESeq2/LgnvWT.RMoutliers.13S.5C.txt",
sep="\t", quote=F)
#export as TSV
write.table(as.data.frame(nodvWT),
file="./DESeq2/nodvWT.RMoutliers.13S.5C.txt",
sep="\t", quote=F)
# general features of DEG
summary(LgnvWT)
plotMA(LgnvWT,ylim=c(-10,10))
out of 22490 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 2293, 10% LFC < 0 (down) : 3981, 18% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 2) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
# general features of DEG
summary(nodvWT)
plotMA(nodvWT,ylim=c(-10,10))
out of 22490 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 933, 4.1% LFC < 0 (down) : 395, 1.8% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 2) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
# collect the DE transcript names
LgnWT_DEG <- rownames ( subset(LgnvWT, padj <= 0.05) )
# how many DEGs?
length(LgnWT_DEG)
nodWT_DEG <- rownames ( subset(nodvWT, padj <= 0.05) )
# how many DEGs?
length(nodWT_DEG)
# simply check the intersection of the DEG names
intersect(LgnWT_DEG, nodWT_DEG)
# what is the length of the intersection?
length(intersect(LgnWT_DEG, nodWT_DEG))
# use the VennDiagram package to make quick venn diagrams
library(VennDiagram)
# use draw.pairwise.venn() to compare two lists with their intersection
# circle area represents actual list size
draw.pairwise.venn(length(LgnWT_DEG),
length(nodWT_DEG),
length(intersect(LgnWT_DEG, nodWT_DEG)),
category = c("Lgn1\nvs\nWT", "nod\nvs\nWT"),
euler.d=T)
Warning message: “package ‘VennDiagram’ was built under R version 3.4.4”Loading required package: grid Loading required package: futile.logger
(polygon[GRID.polygon.2566], polygon[GRID.polygon.2567], polygon[GRID.polygon.2568], polygon[GRID.polygon.2569], text[GRID.text.2570], text[GRID.text.2571], text[GRID.text.2572], text[GRID.text.2573], text[GRID.text.2574])
# make a heatmap of intersect transcripts
# are differentially-expressed genes regulated in the same direction
# or unique to each?
#try some heatmapping
# my favorite package for heatmaps is called pheatmap
library("pheatmap")
#pull out subset of FPKM data that is DE in both Lgn v wt and nod v wt
plot_data <- FPKM[intersect(LgnWT_DEG, nodWT_DEG),]
# rotate and scale FPKM values so every genes varies along the same scale
#make a heatmap
pheatmap(t(scale(t(plot_data))), cluster_rows=FALSE,
show_rownames=FALSE, show_colnames=TRUE,
cluster_cols=FALSE, annotation_col=NULL)
Warning message: “package ‘pheatmap’ was built under R version 3.4.4”
# quite chaotic, huh?
# we can cluster the genes and add margin annotations to improve the visual
#pull out annotation information from DESeq object
df <- as.data.frame(colData(dds_drop)[,c("Lgn","nod")])
#cluster samples and add annotation
pheatmap(t(scale(t(plot_data))), cluster_rows=TRUE,
show_rownames=FALSE, show_colnames=TRUE,
cluster_cols=TRUE, annotation_col=df)
#leave samples in order and add annotation
pheatmap(t(scale(t(plot_data))), cluster_rows=TRUE,
show_rownames=FALSE, show_colnames=TRUE,
cluster_cols=FALSE, annotation_col=df)
# 2) how does double vs WT compare with Lgn vs WT compare with nod vs WT
dblvWT <- results(dds_drop, contrast=c("genotype", "dbl", "WT"), alpha=0.05)
LgnvWT <- results(dds_drop, contrast=c("genotype", "Lgn", "WT"), alpha=0.05)
nodvWT <- results(dds_drop, contrast=c("genotype", "nod", "WT"), alpha=0.05)
#export as TSV
write.table(as.data.frame(dblvWT),
file="./DESeq2/doublevWT.RMoutliers.13S.5C.txt",
sep="\t", quote=F)
summary(dblvWT)
plotMA(dblvWT,ylim=c(-10,10))
out of 22490 with nonzero total read count adjusted p-value < 0.05 LFC > 0 (up) : 4182, 19% LFC < 0 (down) : 4955, 22% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 2) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
# collect the DE transcript names
dblWT_DEG <- rownames ( subset(dblvWT, padj <= 0.05) )
length(dblWT_DEG)
LgnWT_DEG <- rownames ( subset(LgnvWT, padj <= 0.05) )
length(LgnWT_DEG)
nodWT_DEG <- rownames ( subset(nodvWT, padj <= 0.05) )
length(nodWT_DEG)
#find the intersection
intersect(dblWT_DEG, LgnWT_DEG)
length(intersect(dblWT_DEG, LgnWT_DEG))
#find the intersection
intersect(dblWT_DEG, nodWT_DEG)
length(intersect(dblWT_DEG, nodWT_DEG))
#find the intersection
intersect(LgnWT_DEG, nodWT_DEG)
length(intersect(LgnWT_DEG, nodWT_DEG))
#intersect the intersects
intersect(intersect(dblWT_DEG, LgnWT_DEG), nodWT_DEG)
length(intersect(intersect(dblWT_DEG, LgnWT_DEG), nodWT_DEG))
#make a three-way venndiagram
library(VennDiagram)
overrideTriple=T #required by VennDiagram to scale area to size of list
draw.triple.venn( length(dblWT_DEG),
length(LgnWT_DEG),
length(nodWT_DEG),
length(intersect(dblWT_DEG, LgnWT_DEG)),
length(intersect(LgnWT_DEG, nodWT_DEG)),
length(intersect(dblWT_DEG, nodWT_DEG)),
length(intersect(intersect(dblWT_DEG, LgnWT_DEG), nodWT_DEG)),
category = c("dbl\nvs\nWT","Lgn1\nvs\nWT", "nod\nvs\nWT"),
euler.d=T, scaled=T)
(polygon[GRID.polygon.2627], polygon[GRID.polygon.2628], polygon[GRID.polygon.2629], polygon[GRID.polygon.2630], polygon[GRID.polygon.2631], polygon[GRID.polygon.2632], text[GRID.text.2633], text[GRID.text.2634], text[GRID.text.2635], text[GRID.text.2636], text[GRID.text.2637], text[GRID.text.2638], text[GRID.text.2639], text[GRID.text.2640], text[GRID.text.2641], text[GRID.text.2642])
# heatmap of intersect transcripts
# are genes co-differential, or unique to each
library("pheatmap")
# get plot data from the overall FPKM
plot_data <- FPKM[intersect(intersect(dblWT_DEG, LgnWT_DEG), nodWT_DEG),]
df <- as.data.frame(colData(dds_drop)[,c("Lgn","nod")])
pheatmap(t(scale(t(plot_data))), cluster_rows=TRUE,
show_rownames=FALSE, show_colnames=TRUE,
cluster_cols=TRUE, annotation_col=df)
pheatmap(t(scale(t(plot_data))), cluster_rows=TRUE,
show_rownames=FALSE, show_colnames=TRUE,
cluster_cols=FALSE, annotation_col=df)
What other comparisons can we make?
What else can we do to make sense of these DEG?
# what genes are DEG v WT in all
length(intersect(intersect(nodWT_DEG, dblWT_DEG), LgnWT_DEG))
write.table(intersect(intersect(nodWT_DEG, dblWT_DEG), LgnWT_DEG),
file = "./DESeq2/dblLgnnod_shared_genes.RMoutliers.13S.5C.txt",
quote = F, sep = "\t")
# what genes are unique to double?
length(setdiff(setdiff(dblWT_DEG, LgnWT_DEG), nodWT_DEG))
write.table(x = setdiff(setdiff(dblWT_DEG, LgnWT_DEG), nodWT_DEG),
file = "./DESeq2/dbl_unique_genes.RMoutliers.13S.5C.txt",
quote = F, sep ="\t")
#what genes are unique to lgn and nod but not double?
length(setdiff(intersect(nodWT_DEG, LgnWT_DEG), dblWT_DEG))
write.table(x = setdiff(intersect(nodWT_DEG, LgnWT_DEG), dblWT_DEG),
file = "./DESeq2/Lgnnod_not_double_genes.13S.5C.txt",
quote = F, sep = "\t")
#what genes are unique to Lgn?
length(setdiff(setdiff(LgnWT_DEG, dblWT_DEG), nodWT_DEG))
write.table(x = setdiff(setdiff(LgnWT_DEG, dblWT_DEG), nodWT_DEG),
file = "./DESeq2/Lgn_unique_genes.RMoutliers.13S.5C.txt",
quote = F, sep ="\t")
#what genes are unique to nod?
length(setdiff(setdiff(nodWT_DEG, dblWT_DEG), LgnWT_DEG))
write.table(x = setdiff(setdiff(nodWT_DEG, dblWT_DEG), LgnWT_DEG),
file = "./DESeq2/nod_unique_genes.RMoutliers.13S.5C.txt",
quote = F, sep ="\t")
check out the tools at http://systemsbiology.cau.edu.cn/agriGOv2/
calculate the SEA -- gene set enrichment for each mutant's unique DEG
http://systemsbiology.cau.edu.cn/agriGOv2/classification_analysis.php?category=Plant&&family=Poaceae
Let's try some machine learning
We will make inference trees to classify mutant type by RNAseq
### try a random forest walk?
#did you bring your boots?
require(randomForest)
Loading required package: randomForest
Warning message:
“package ‘randomForest’ was built under R version 3.4.4”randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
The following object is masked from ‘package:Biobase’:
combine
The following object is masked from ‘package:BiocGenerics’:
combine
# there are some random features to randomForest
# we can use a 'random seed' to make the random features reproducible
# you may use any seed you like, but for this example let's use the same one
set.seed(1234)
# lets try making a model
# we will predict FPKM (x)
# using genotype (y)
# and a small number of decision trees (ntree)
# x
head (FPKM)
| dbl19_S46 | dbl28_S48 | dbl31_S41 | dbl50_S43 | dbl54_S44 | dbl56_S50 | dbl57_S52 | dbl65_S42 | dbl66_S49 | dbl70_S47 | ⋯ | nod60_S19 | nod62_S14 | WT10_S8 | WT18_S4 | WT25_S7 | WT38_S5 | WT41_S6 | WT55_S9 | WT58_S3 | WT78_S2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GRMZM2G059865 | 71.043567 | 63.867580 | 55.812321 | 70.074839 | 58.8407244 | 55.935982 | 67.250536 | 63.402847 | 56.842000 | 60.817282 | ⋯ | 72.640036 | 71.167770 | 95.879469 | 100.226023 | 89.092697 | 85.05722 | 88.625578 | 82.921249 | 90.864842 | 90.778483 |
| GRMZM5G888250 | 9.415799 | 4.048901 | 5.982226 | 9.399094 | 4.7519816 | 9.889867 | 3.648548 | 7.798303 | 10.921149 | 5.152199 | ⋯ | 7.264874 | 10.971986 | 11.289747 | 14.331842 | 16.448432 | 10.10277 | 11.970071 | 6.841822 | 13.841072 | 12.176997 |
| GRMZM2G093344 | 1.523144 | 2.944656 | 2.430279 | 2.349773 | 0.9503963 | 1.780176 | 2.964445 | 2.112040 | 2.730287 | 3.381130 | ⋯ | 2.335138 | 2.133442 | 1.963434 | 4.200712 | 2.436805 | 4.04111 | 3.222712 | 3.420911 | 2.018490 | 5.137171 |
| GRMZM5G809743 | 9.436313 | 21.321487 | 12.739926 | 11.646015 | 13.7631467 | 13.479522 | 11.655085 | 26.571994 | 17.365975 | 10.972275 | ⋯ | 22.102129 | 2.077001 | 8.362775 | 26.943130 | 13.492678 | 14.45819 | 8.628000 | 15.299075 | 8.842907 | 18.152746 |
| GRMZM5G833153 | 10.966636 | 17.545239 | 16.176937 | 16.918368 | 18.4693684 | 12.474419 | 14.046911 | 20.253925 | 15.483762 | 13.932404 | ⋯ | 19.027050 | 8.493130 | 15.838369 | 17.033476 | 19.210145 | 23.11515 | 10.803757 | 11.916174 | 11.841806 | 15.906203 |
| AC177838.2_FG015 | 4.295149 | 4.995196 | 5.557244 | 3.589181 | 4.6063270 | 2.300815 | 3.241989 | 5.879436 | 3.293547 | 5.618561 | ⋯ | 4.694797 | 2.363484 | 5.392548 | 5.748642 | 4.724227 | 7.05104 | 3.272690 | 4.974084 | 2.981517 | 4.918232 |
# y
colData(dds_drop)$genotype
# lets try making a model
# we will predict FPKM (x)
# using genotype (y)
# and a small number of decision trees (ntree)
rf.10 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=10)
print(rf.10)
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 10)
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 149
OOB estimate of error rate: 11.63%
Confusion matrix:
WT dbl Lgn nod class.error
WT 7 0 1 0 0.1250000
dbl 0 11 2 0 0.1538462
Lgn 0 0 12 0 0.0000000
nod 1 0 1 8 0.2000000
# to create an effective model, we can emprically determine the optimum:
# 1) number of decision trees to use
# 2) number of paramaters (--genes in our case--) available to each tree
# test a range of tree numbers
rf.10 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=10)
rf.50 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=50)
rf.100 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=100)
rf.500 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=500)
rf.1000 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=1000)
print(rf.10)
print(rf.50)
print(rf.100)
print(rf.500)
print(rf.1000)
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 10)
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 149
OOB estimate of error rate: 30.95%
Confusion matrix:
WT dbl Lgn nod class.error
WT 5 0 1 2 0.3750000
dbl 0 10 2 0 0.1666667
Lgn 1 1 9 1 0.2500000
nod 1 2 2 5 0.5000000
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 50, importance = T)
Type of random forest: classification
Number of trees: 50
No. of variables tried at each split: 149
OOB estimate of error rate: 6.82%
Confusion matrix:
WT dbl Lgn nod class.error
WT 8 0 0 0 0.0
dbl 0 13 0 0 0.0
Lgn 0 0 13 0 0.0
nod 2 0 1 7 0.3
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 100)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 149
OOB estimate of error rate: 6.82%
Confusion matrix:
WT dbl Lgn nod class.error
WT 7 0 0 1 0.12500000
dbl 0 12 1 0 0.07692308
Lgn 0 0 13 0 0.00000000
nod 1 0 0 9 0.10000000
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 149
OOB estimate of error rate: 2.27%
Confusion matrix:
WT dbl Lgn nod class.error
WT 7 0 0 1 0.125
dbl 0 13 0 0 0.000
Lgn 0 0 13 0 0.000
nod 0 0 0 10 0.000
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 1000)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 149
OOB estimate of error rate: 2.27%
Confusion matrix:
WT dbl Lgn nod class.error
WT 7 0 0 1 0.125
dbl 0 13 0 0 0.000
Lgn 0 0 13 0 0.000
nod 0 0 0 10 0.000
# 50 trees is looking pretty sweet
# can you find a lower tree number that does as well?
rf.10 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=10)
rf.20 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=20)
rf.30 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=30)
rf.40 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=40)
rf.50 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=50)
print (rf.10)
print (rf.20)
print (rf.30)
print (rf.40)
print (rf.50)
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 10)
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 149
OOB estimate of error rate: 30.95%
Confusion matrix:
WT dbl Lgn nod class.error
WT 5 0 1 2 0.3750000
dbl 0 10 2 0 0.1666667
Lgn 1 1 9 1 0.2500000
nod 1 2 2 5 0.5000000
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 20)
Type of random forest: classification
Number of trees: 20
No. of variables tried at each split: 149
OOB estimate of error rate: 9.09%
Confusion matrix:
WT dbl Lgn nod class.error
WT 7 0 0 1 0.12500000
dbl 0 12 1 0 0.07692308
Lgn 0 1 12 0 0.07692308
nod 1 0 0 9 0.10000000
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 30)
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 149
OOB estimate of error rate: 4.55%
Confusion matrix:
WT dbl Lgn nod class.error
WT 7 0 0 1 0.12500000
dbl 0 13 0 0 0.00000000
Lgn 0 0 12 1 0.07692308
nod 0 0 0 10 0.00000000
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 40)
Type of random forest: classification
Number of trees: 40
No. of variables tried at each split: 149
OOB estimate of error rate: 2.27%
Confusion matrix:
WT dbl Lgn nod class.error
WT 7 0 0 1 0.125
dbl 0 13 0 0 0.000
Lgn 0 0 13 0 0.000
nod 0 0 0 10 0.000
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 50)
Type of random forest: classification
Number of trees: 50
No. of variables tried at each split: 149
OOB estimate of error rate: 0%
Confusion matrix:
WT dbl Lgn nod class.error
WT 8 0 0 0 0
dbl 0 13 0 0 0
Lgn 0 0 13 0 0
nod 0 0 0 10 0
# now to refine the number of genes considered in each tree
# the randomForest package comes with a built in function that will try to optimize
# by trying mtry values higher or lower than the starting one, and looking for improvement
# when it can't improve by calculating a new mtry tree, it stops
mtry <- tuneRF(x = t(FPKM), y = colData(dds_drop)$genotype, ntreeTry=50, mtryStart=149,
stepFactor=5,improve=1, trace=TRUE, plot=TRUE) #increases/decreases by step factor and checks for improve %improvement
print(mtry)
mtry = 149 OOB error = 0%
Searching left ...
mtry = 30 OOB error = 13.64%
-Inf 1
Searching right ...
mtry = 745 OOB error = 2.27%
-Inf 1
mtry OOBError
30.OOB 30 0.13636364
149.OOB 149 0.00000000
745.OOB 745 0.02272727
# let's calculate the relative importance of all genes involved in making the decision tree
rf.50 <- randomForest(x=t(FPKM), y=colData(dds_drop)$genotype, ntree=50, importance=T)
print(rf.50)
Call:
randomForest(x = t(FPKM), y = colData(dds_drop)$genotype, ntree = 50, importance = T)
Type of random forest: classification
Number of trees: 50
No. of variables tried at each split: 149
OOB estimate of error rate: 6.82%
Confusion matrix:
WT dbl Lgn nod class.error
WT 8 0 0 0 0.0
dbl 0 13 0 0 0.0
Lgn 0 0 13 0 0.0
nod 2 0 1 7 0.3
# plot importances
varImpPlot(rf.50, n.var=25, main="Top 25 Influential Transcripts")
#extract influence metrics from random forest generation
influence <- importance(rf.50)
#order by error and output
head(influence[order(influence[,6], decreasing=T),])
influence.order <- influence[order(influence[,6], decreasing=T),]
write.table(influence.order,
file="./DESeq2/randomForest_importance.13S.5C.txt", sep="\t", quote=F)
| WT | dbl | Lgn | nod | MeanDecreaseAccuracy | MeanDecreaseGini | |
|---|---|---|---|---|---|---|
| GRMZM2G140231 | 1.010153 | 0.000000 | 1.3664626 | 1.010153 | 1.201540 | 0.4900369 |
| GRMZM2G087267 | 0.000000 | 0.000000 | 1.0101525 | 1.010153 | 1.010153 | 0.4860207 |
| GRMZM2G153536 | 1.366463 | 1.230915 | 1.2728909 | 1.285649 | 1.440046 | 0.4669986 |
| GRMZM2G052825 | 1.414214 | 0.000000 | 0.6350006 | 1.010153 | 1.174942 | 0.4345720 |
| GRMZM2G087625 | 1.010153 | 0.000000 | 1.4433757 | 1.443376 | 1.424208 | 0.4123665 |
| GRMZM2G064106 | 1.414214 | 0.000000 | 1.4433757 | 1.366463 | 1.422413 | 0.4004144 |
Please find 2-3 collaborators.
Name your group. Using the tools available to you,
prepare analysis, figures, tables to address the following hypothesis:
EVENS: Lgn and nod act in the same pathway
ODDS: Lgn and nod act in different pathways
Suggestions for additional anlysis:
? Use AgriGO2 to determine enrichment in unique and shared DEG lists
? Are any pathways over represented
? Look at the most informative genes from randomforest for each mutant
? How are informative genes expressed accross mutants?
Get creative!